arXiv:1503.07439v2 [cs.SI] 3 Apr 2015 


Overlapping Community Detection Using 
Neighborhood-Inflated Seed Expansion 

Joyce Jiyoung Whang^, David F. Gleich^, and Inderjit S. Dhillon^ 

^Dept. of Computer Science, University of Texas at Austin, {Joyce,inderjit}@cs.utexas.edu 
^Dept. of Computer Science, Purdue University, dgleich@purdue.edu 


Abstract 

Community detection is an important task in network analysis. A community (also referred to 
as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. 

In many social and information networks, these communities naturally overlap. For instance, in 
a social network, each vertex in a graph corresponds to an individual who usually participates 
in multiple communities. In this paper, we propose an efficient overlapping community detection 
algorithm using a seed expansion approach. The key idea of our algorithm is to find good seeds, 
and then greedily expand these seeds based on a community metric. Within this seed expansion 
method, we investigate the problem of how to determine good seed nodes in a graph. In particular, 
we develop new seeding strategies for a personalized PageRank clustering scheme that optimizes 
the conductance community score. Experimental results show that our seed expansion algorithm 
outperforms other state-of-the-art overlapping community detection methods in terms of producing 
cohesive clusters and identifying ground-truth communities. We also show that our new seeding 
strategies are better than existing strategies, and are thus effective in finding good overlapping 
communities in real-world networks. 

Index Terms — Community Detection, Clustering, Overlapping Communities, Seed Expansion, 
Seeds, Personalized PageRank. 


1 Introduction 

Community detection is one of the most important and fundamental tasks in network analysis with 
applications in functional prediction in biology [18] and sub-market identification [5] among others. 
Given a network, a community is defined to be a set of cohesive nodes that have more connections inside 
the set than outside. Since a network can be modelled as a graph with vertices and edges, community 
detection can be thought as a graph clustering problem where each community corresponds to a cluster 
in the graph. In this manuscript, the terms cluster and community are used interchangeably. 

The goal of traditional, exhaustive graph clustering algorithms (e.g.. Metis m, Graclus m) is to 
partition a graph such that every node belongs to exactly one cluster. However, in many social and 
information networks, nodes participate in multiple communities. For instance, in a social network, 
nodes represent individuals and edges represent social interactions between the individuals. In this 
setting, a node’s communities can be interpreted as its social circles. Thus, it is likely that a node 
belongs to multiple communities, i.e., communities naturally overlap. To find these groups, we study 
the problem of overlapping community detection where communities are allowed to overlap with each 
other and some nodes are allowed not to belong to any cluster. 

In this paper, we propose an efficient overlapping community detection algorithm using a seed 
expansion approach. More specifically, we investigate how to select good seeds in a method that 
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grows communities around seeds. These local expansion methods are among the most successful 
strategies for overlapping community detection [32j . However, principled methods to choose the seeds 
are few and far between. When they exist, they are usually computationally expensive, for instance, 
using maximal cliques as seeds m- Empirically successful strategies include exhaustively exploring 
all individual seeds and greedy methods that randomly pick a vertex, grow a cluster, and continue 
with any unassigned vertex. 

To find a set of good seeds, we present two effective seeding strategies which we call “Graclus 
centers” and “Spread hubs.” The “Graclus centers” seeding is based on the same distance kernel that 
underlies the equivalence between kernel fc-means and graph clustering objectives m- Using this 
distance function, we can efficiently locate a good seed within an existing set of cohesive vertices of 
the graph. Specifically, we hrst compute many clusters using a multi-level weighted kernel /c-means 
algorithm on the graph (the Graclus algorithm) [TO], then use the corresponding distance function 
to compute the “centroid vertex” of each cluster. We use the neighborhood set of each centroid 
vertex as a seed region for community detection. The idea of “Spread hubs” seeding is to select an 
independent set of high degree vertices. This seeding strategy is inspired by the recent observations 
that there should be good clusters around high degree vertices in many real-world networks which 
have a power-law degree distribution m, [E]. 

The algorithm we use to grow a seed set is based on personalized PageRank (PPR) clustering [1] . 
The high level idea of this expansion method is to first compute the PPR vector for each of the seeds, 
and then expand each seed based on the PPR score. It is important to note that we can have multiple 
nodes in the personalization vector, and indeed we use the entire vertex neighborhood of a seed node 
as the personalization vector for PPR. This neighborhood inflation plays a critical role in the success 
of our algorithm. The full algorithm to compute overlapping clusters from the seeds is discussed in 
Section [3j We name our algorithm nise by abbreviating our main idea, Neighborhood-Inflated Seed 
Expansion. 

Our experimental results show that our seeding strategies are better than existing seeding strate¬ 
gies, and effective in finding good overlapping communities in real-world networks. More importantly, 
we observe that nise significantly outperforms other state-of-the-art overlapping community detection 
methods in terms of producing cohesive clusters and identifying ground-truth communities. Also, our 
method scales to problems with over 45 million edges, whereas other existing methods were unable to 
complete on these large datasets. 

2 Preliminaries 

In this section, we formally describe the overlapping community detection problem, and review some 
important concepts in graph clustering. Also, we introduce real-world networks which are used in our 
experiments. 

2.1 Problem Statement 

Given a graph G = (V, T) with a vertex set V and an edge set T, we can represent the graph as 
an adjacency matrix A such that Aij = Cij where Cij is the edge weight between vertices i and j, 
or Aij = 0 if there is no edge. We assume that graphs are undirected, i.e., A is symmetric. The 
goal of the traditional, exhaustive graph clustering problem is to partition a graph into k pairwise 
disjoint clusters Ci, • • • ,Cfc such that Ci U • • • UCfc = V. On the other hand, the goal of the overlapping 
community detection problem is to find overlapping clusters whose union is not necessarily equal to 
the entire vertex set V. Formally, we seek k overlapping clusters such that Ci U • • • U U V. 
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2.2 Measures of Cluster Quality 


There are some popular measures for gauging the quality of clusters: cut, normalized cut, and con¬ 
ductance. Let us define links(Cp,Cg) to be the sum of edge weights between vertex sets Cp and 
Cq. 

Cut. The cut of cluster C is defined as the sum of edge weights between Ct and its complement, 
V\Ci. That is, 

cut(C) = links(Ci, V\Ci). (1) 


Normalized Cut. The normalized cut of a cluster is defined by the cut with volume normalization 
as follows: 


ncut(Ci) 


cut(0 
links(Ci, V) 


( 2 ) 


Conductance. The conductance of a cluster is defined to be the cut divided by the least number 
of edges incident on either set C or V\Ci: 


cond(0 


cut (Ci 


min links(Cj, V), links(V\Ci, V) 


By definition, cond(Ci) = cond(V\Cj). The conductance of a cluster is the probability of leaving that 
cluster by a one-hop walk starting from the smaller set between Ci and V\Ci. Notice that cond(Cj) is 
always greater than or equal to ncut(Ci). 


2.3 Graph Clustering and Weighted Kernel fc-means 


It has been shown that a graph clustering objective is mathematically equivalent to a weighted kernel 
fc-means objective m- For example, let us consider the normalized cut objective of a graph G which 
is defined to be 


ncut(G) 


. ^ links(Cj,V\Cj) 

mm > -^^— 

Ci,...,Cfc^ links(Cj,V) 


(3) 


This objective can be shown to be equivalent to a weighted kernel fc-means objective by defining a 
weight for each data point to be the degree of a vertex, and the kernel matrix to be iL = aD~^ + 
D~^AD~^, where D is the diagonal matrix of degrees (i.e.. Da = Aij where n is the total number 

of nodes), and o" is a scalar typically chosen to make K positive-dehnite. Then, we can quantify the 
kernel distance between a vertex v € Ci and cluster Ci, denoted dist('(;,Ci), as follows: 


dist(r;,Ci)= (4) 

2 links(r;, Ci) links(Ci,Ci) a a 

deg{v) deg(Ci) deg(Ci)2 deg(?;) deg(Ci) 

where deg(u) = links(u,V), and deg(Ci) = links(Ci,V). 


2.4 Datasets 

We use ten different real-world networks including collaboration networks, social networks, and a 
product network from m, 128], and m- The networks are presented in Tabled) All the networks are 
loop-less, connected, undirected graphs. 

In a collaboration network, vertices indicate authors, and edges indicate co-authorship. If authors 
u and V wrote a paper together, there exists an edge between them. For example, if a paper is written 
by three authors, this is represented by a clique of size three in the network. HepPh, AstroPh, and 
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Table 1: Summary of Real-world Networks. 


Category 

Graph 

No. of vertices 

No. of edges 

Max. Deg. 

Avg. Deg. 

Avg. CC 

Ground-truth 

Source 

Collaboration 

HepPh 

11,204 

117,619 

491 

21.0 

0.6216 

N/A 

m 


AstroPh 

17,903 

196,972 

504 

22.0 

0.6328 

N/A 

m 


CondMat 

21,363 

91,286 

279 

8.5 

0.6417 

N/A 

m 


DBLP 

317,080 

1,049,866 

343 

6.6 

0.6324 

/ 

m 

Product 

Amazon 

334,863 

925,872 

549 

5.5 

0.3967 

/ 

m 

Social 

Orkut 

731,332 

21,992,171 

6,933 

60.1 

0.2468 

/ 

m 


Flickr 

1,994,422 

21,445,057 

27,908 

21.5 

0.1881 

N/A 

m 


Myspace 

2,086,141 

45,459,079 

92,821 

43.6 

0.1242 

N/A 

[28] 


LiveJournal 

1,757,326 

42,183,338 

29,771 

48.0 

0.2400 

N/A 

m 


LiveJournal2 

1,143,395 

16,880,773 

11,495 

29.5 

0.2535 

/ 

m 



Figure 1: Degree distributions of real-world networks - the degree distributions follow a power-law. 


CondMat networks are constructed based on the papers submitted to arXiv e-print service. Specifi¬ 
cally, HepPh represents the High Energy Physics (Phenomenology) category, AstroPh represents the 
Astrophysics category, and CondMat represents the Condensed Matter Physics category. The DBLP 
network is constructed based on the DBLP computer science bibliography website. 

We use five different social networks: Flickr, Myspace, LiveJournal, LiveJournal2 (a variation 
with ground-truth), and Orkut. Flickr is an online photo sharing application, Myspace is a social 
entertainment networking service, LiveJournal is a blogging application where users can publish their 
own journals, and Orkut was a social networking website operated by Google. Users can make a 
friendship relationship with each other in each of these websites. So, in these social networks, nodes 
represent users and edges represent friendship relationships between them. 

In the Amazon product network, vertices represent products and edges represent co-purchasing 
information. If products u and v are frequently co-purchased, there exists an undirected edge between 
them. This network is constructed based on Customers Who Bought This Item Also Bought feature 
of the Amazon website. 

In Table [U we present the number of nodes/edges, the maximum degree, the average degree, and 
the average clustering coefficient (CC) of each of the networks. Figured] shows the degree distributions 
of DBLP, Flicker and Amazon networks. We can see that the real-world networks have distinguishing 
characteristics; a power-law degree distribution [6] and a high clustering coefficient m, m- 

As indicated in Table dl we have ground-truth communities [T] on some of the datasets. In 
DBLP, each publication venue (i.e., journal or conference) can be considered as an individual ground- 
truth community. In the Amazon network, each ground-truth community can be defined to be a 
product category that Amazon provides. In LiveJournal2 and Orkut networks, there exists user- 
defined social groups. On LiveJournal2 and Orkut networks, the ground-truth communities do not 
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cover a substantial portion of the graph, so we use a subgraph which is induced by the nodes that 
have at least one membership in the ground-truth communities. In Table [H the statistics about 
LiveJournal2 and Orkut are based on the induced subgraphs we used in our experiments. 

3 Overlapping Community Detection Using Neighborhood-Inflated 
Seed Expansion 

We introduce our overlapping community detection algorithm, nise. It consists of four phases: fil¬ 
tering, seeding, seed expansion, and propagation. In the filtering phase, we remove regions of the 
graph that are trivially separable from the rest of the graph. In the seeding phase, we find good 
seeds in the filtered graph, and in seed expansion phase, we expand the seeds using a personalized 
PageRank clustering scheme. Finally, in the propagation phase, we further expand the communities 
to the regions that were removed in the filtering phase. 

3.1 Filtering Phase 

The goal of the filtering phase is to identify regions of the graph where an algorithmic solution is 
required to identify the overlapping clusters. To explain our filtering step, recall that almost all graph 
partitioning methods begin by assigning each connected component to a separate partition. Any other 
choice of partitioning for disconnected components is entirely arbitrary. The Metis procedure m, 
for instance, may combine two disconnected components into a single partition in order to satisfy 
a balance constraint on the partitioning. For the problem of overlapping clustering, an analogous 
concept can be derived from biconnected components. Formally, a biconnected component is defined 
as follows: 

Definition 1. Given a graph G = (V, T), a biconnected component is a maximal induced subgraph 
G' = iy',£') that remains connected after removing any vertex and its adjacent edges in G'. 

Let us dehne the size of a biconnected component to be the number of edges in G'. Now, consider 
all the biconnected components of size one. Notice that there should be no overlapping partitions 
that use these edges because they bridge disjoint communities. Consequently, our filtering procedure 
is to hnd the largest connected component of the graph after we remove all single-edge biconnected 
components. We call this the “biconnected core” of the graph even though it may not be biconnected. 
Let £s denote all the single-edge biconnected components. Then, the biconnected core graph is dehned 
as follows: 

Definition 2. The biconnected core Gc = {Vc,£c) is the maximum size connected subgraph of G" = 

iV,£\£s)- 

Notice that the biconnected core is not the 2-core of the original graph (a k-core graph is a maximal 
subgraph of the original graph in which all nodes have degree at least k [26]). Subgraphs connected 
to the biconnected core are called whiskers by Leskovec et al. m and we use the concept of a bridge 
to define them: 

Definition 3. A bridge is a biconnected component of size one which is directly connected to the 
biconnected core. 

Whiskers are then defined as follows: 

Definition 4. A whisker W = (yw,£w) is o, maximal subgraph of G that can be detached from the 
biconnected core by removing a bridge, 
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Figure 2: Biconnected core, whiskers, and bridges - grey region indicates the biconnected core where 
vertices are densely connected to each other, and green components indicate whiskers. Red edges 
indicate bridges which connect the biconnected core and each of the whiskers. 

Table 2: Biconnected core and the detached graph (in the last column, LCC refers to the largest 


connected component). 



Biconnected core 


Detached graph 


No. of vertices (%) 

No. of edges (%) 

No. of components 

Size of the LCC (%) 

HepPh 

9,945 (88.8%) 

116,099 (98.7%) 

1,123 

21 (0.1874%) 

AstroPh 

16,829 (94.0%) 

195,835 (99.4%) 

957 

23 (0.1285%) 

CondMat 

19,378 (90.7%) 

89,128 (97.6%) 

1,669 

12 (0.0562%) 

DBLP 

264,341 (83.4%) 

991,125 (94.4%) 

43,093 

32 (0.0101%) 

Amazon 

291,449 (87.0%) 

862,836 (93.2%) 

25,835 

250 (0.0747%) 

Flickr 

954,672 (47.9%) 

20,390,649 (95.1%) 

864,628 

107 (0.0054%) 

Myspace 

1,724,184 (82.7%) 

45,096,696 (99.2%) 

332,596 

32 (0.0015%) 

LiveJournal 

1,650,851 (93.9%) 

42,071,541 (99.7%) 

101,038 

105 (0.0060%) 

LiveJournal2 

1,076,499 (94.2%) 

16,786,580 (99.4%) 

59,877 

91 (0.0080%) 

Orkut 

729,634 (99.8%) 

21,990,221 (99.9%) 

1,529 

15 (0.0021%) 


Let Eb be all the bridges in a graph. Notice that Eb C Es- On the region which is not included in 
the biconnected core graph Gc, we define the detached graph Gd ss follows: 

Definition 5. Gb = {Vd^Eb) is the subgraph of G which is induced by V \ Vc- 

Finally, given the original graph G = (V,E), V and E can be decomposed as follows: 

Proposition 1. Given a graph G = (V, E), V = Vc U Vb o-nd E = Ec ^ Eb ^ Eb- 

Proof. This follows from the dehnitions of the biconnected core, bridges, and the detached graph. □ 

Figure [2] illustrates the biconnected core, whiskers, and bridges. The output of our filtering phase 
is the biconnected core graph where whiskers are filtered out. The filtering phase removes regions of 
the graph that are clearly partitionable from the remainder. Note that there is no overlap between 
any of the whiskers. This indicates that there is no need to apply overlapping community detection 
algorithm on the detached regions. 

Table [2] shows the size of the biconnected core and the connectivity of the detached graph in our 
real-world networks. Details of these networks are presented in Table [TJ We compute the size of the 
biconnected core in terms of the number of vertices and edges. The number reported in the parenthesis 
shows how many vertices or edges are included in the biconnected core, i.e., the percentages of iVcI/lVl 
and respectively. We also compute the number of connected components in the detached 

graph, and the size of the largest connected component (LCC in Table [2]) in terms of the number of 
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Algorithm 1 Seeding by Graclus Centers 
Input: graph G, the number of seeds k. 

Output: the seed set S. 

1: Compute exhaustive and non-overlapping clusters Cj (i=l, k) on G. 
2: Initialize iS = 0. 

3 : for each cluster Cj do 

4 ; for each vertex v a Ci do 

5 ; Compute dist('(;,Cj) using (|H). 

6: end for 

7 ; S = {argmin dist('(;,Ci)} U 5. 

V 

8 ; end for 


vertices. The number reported in the parenthesis indicates the relative size of the largest connected 
component compared to the number of vertices in the original graph. 

We can see that the biconnected core contains a substantial portion of the edges. In terms of 
the vertices, the biconnected core contains around 80 or 90 percentage of the vertices for all datasets 
except Flickr. In Flickr, the biconnected core only contains around 50 percentage of the vertices while 
it contains 95 percentage of edges. This indicates that the biconnected core is dense while the detached 
graph is quite sparse. Recall that the biconnected core is one connected component. On the other 
hand, in the detached graph, there are many connected components, which implies that the vertices 
in the detached graph are likely to be disconnected with each other. Notice that each connected 
component in the detached graph corresponds to a whisker. So, the largest connected component can 
be interpreted as the largest whisker. Based on the statistics of the detached graph, we can see that 
whiskers tend to be separable from each other, and there are no significant size whiskers. Also, the 
gap between the sizes of the biconnected core and the largest whisker is significant. All these statistics 
and observations support that our hltering phase creates a reasonable and more tractable input for 
an overlapping community detection algorithm. 

3.2 Seeding Phase 

Once we obtain the biconnected core graph, we find seeds in this filtered graph. The goal of an 
effective seeding strategy is to identify a diversity of vertices, each of which lies within a cluster of 
good conductance. This identification should not be too computationally expensive. 

Graclus Ceuters. One way to achieve these goals is to first apply a high quality and fast graph 
partitioning scheme (disjoint clustering of vertices in a graph) in order to compute a collection of sets 
with fairly small conductance. Then, we select a set of seeds by picking the most central vertex from 
each set (cluster). The idea here is roughly that we want something that is close to the partitioning -- 
which ought to be good - but that allows overlap to produce better boundaries between the partitions. 

See Algorithm [1] for the full procedure. In practice, we perform top-down hierarchical clustering 
using Craclus [10] to get a large number of clusters. Then, we take the center of each cluster as a seed 
- the center of a cluster is defined to be the vertex that is closest to the cluster centroid (as discussed 
in Section 12.,'ll we can quantify the distance between a vertex and a cluster centroid by using the 
kernel that underlies the relationship between kernel /c-means and graph clustering); see steps 5 and 
7 in Algorithm [1] If there are several vertices whose distances are tied for the center of a cluster, we 
include all of them. 

Spread Hubs. From another viewpoint, the goal is to select a set of well-distributed seeds in 
the graph, such that they will have high coverage after we expand the sets. We greedily choose an 
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Algorithm 2 Seeding by Spread Hubs 
Input: graph G = {V,£), the number of seeds k. 

Output: the seed set S. 

1: Initialize 5 = 0. 

2: All vertices in V are unmarked. 

3: while |5| < A: do 

4: Let T be the set of unmarked vertices with max degree. 

5: for each f G T do 

6: if t is unmarked then 

7: S = {t}US. 

8: Mark t and its neighbors. 

9: end if 

10: end for 

11: end while 


independent set of k points in the graph by looking at vertices in order of decreasing degree. For this 
heuristic, we draw inspiration from the distance function Q, which shows that the distance between 
a vertex and a cluster is inversely proportional to degree. Thus, high degree vertices are expected 
to have small distances to many other vertices. This also explains why we call the method spread 
hubs. It also follows from the recent results in m. [31], m which show that there should be good 
clusters around high degree vertices in power-law graphs with high clustering coefficients. We use an 
independent set in order to avoid picking seeds nearby each other. 

Our full procedure is described in Algorithm [2l In the beginning, all the vertices are unmarked. 
Until k seeds are chosen, the following procedure is repeated: among unmarked vertices, the highest 
degree vertex is selected as a seed, and then the selected vertex and its neighbors are marked. As the 
algorithm proceeds exploring hubs in the network, if there are several vertices whose degrees are the 
same, we take an independent set of those that are unmarked. This step may result in more than k 
seeds, however, the final number of returned seeds typically does not exceed the input k too much 
because there usually are not too many high degree vertices. 

3.3 Seed Expansion Phase 

Once we have a set of seed vertices, we wish to expand the clusters around those seeds. An effective 
technique for this task is using a personalized PageRank (PPR) vector [22], also known as a random- 
walk with restart [23]. A personalized PageRank vector is the stationary distribution of a random 
walk that, with probability a follows a step of a random walk and with probability (1 — a) jumps back 
to a seed node. If there are multiple seed nodes, then the choice is usually uniformly random. Thus, 
nodes close by the seed are more likely to be visited. 

Recently, such techniques have been shown to produce communities that best match communities 
found in real-world networks [2]. In fact, personalized PageRank vectors have close relationships to 
graph cuts and clustering methods. Andersen et al. [4] show that a particular algorithm to compute a 
personalized PageRank vector, followed by a sweep over all cuts induced by the vector, will identify a 
set of good conductance within the graph. They prove this via a “localized Cheeger inequality” that 
states, informally, that the set identified via this procedure has a conductance that is not too far away 
from the best conductance of any set containing that vertex. Also, Mahoney et al. [20] show that 
personalized PageRank is, effectively, a seed-biased eigenvector of the Laplacian. They also show a 
limit to relate the personalized PageRank vectors to the Fiedler vector of a graph. 






Algorithm 3 Seed Expansion by PPR 

Input: graph G = (V, £), a seed node s £ S, PageRank link-following probability parameter 0 < a < 
1, accuracy e > 0 
Output: low conductance set C 
1: Set T = {s} U {neighbors of s} 

2; Initialize Xy=0 for v £V 

3: Initialize = 0 for n € V \ T, = 1/|T| for n € T 
4: while any Ty > deg(n)e do 
5: Update Xy = Xy + [1 — a)ry. 

6; For each {v, u) G £, 

update ru = ru + axy/[2 deg{v)) 

7: Update Xy = aryl2 

8 : end while 

9; Sort vertices by decreasing Xy/ deg(?;) 

10: For each prehx set of vertices in the sorted list, compute the conductance of that set and set C to 
be the set that achieves the minimum. 


We briefly summarize the PPR-based seed expansion procedure in Algorithm [3] (each seed is 
expanded by this procedure). Please see Andersen et al. [1] for a full description of the algorithm. 
The high level idea of this expansion method is that given a set of restart nodes (denoted by T in 
Algorithm [3j) , we first compute the PPR vector, examine nodes in order of highest to lowest PPR 
score, and then return the set that achieves the minimum conductance. 

It is important to note that we can have multiple nodes in T (which corresponds to nonzero 
elements in the personalization vector in PPR), and indeed we use the entire vertex neighborhood of 
a seed node as the restart nodes (see step 1 in Algorithm [3|) . Since we do not just use a singleton 
seed but also use its neighbors as the restart nodes in PPR, we call step 1 neighborhood inflation. We 
empirically observed that this neighborhood inflation plays a critical role in producing low conductance 
communities. See Section [5] for details. Recently, Gleich and Seshadhri [13] have provided some 
theoretical justification for why neighborhood-inflated seeds may outperform a singleton seed in PPR 
expansion on many real-world networks. 

Steps 2-8 are closely related to a coordinate descent optimization procedure [7j on the PageRank 
linear system. Although it may not be apparent from the procedure, this algorithm is remarkably 
efficient when combined with appropriate data structures. The algorithm keeps two vectors of values 
for each vertex, x and r. In a large graph, most of these values will remain zero on the vertices 
and hence, these need not be stored. Our implementation uses a hash table for the vectors x and r. 
Consequently, the sorting step is only over a small fraction of the total vertices. Typically, we hnd 
this method takes only a few milliseconds, even for a large graph. 

In the original PPR clustering scheme (4|, the PPR score is divided by the degree of each node (step 
9) to remove bias towards high degree nodes. This step converts a PageRank vector, a left eigenvector 
of a Markov chain, into the right eigenvector of a Markov chain. Right eigenvectors are close relatives 
of the Fiedler vector of a graph, and so this degree normalization produces a vector that we call the 
Fiedler Personalized PageRank vector because of this relationship. Fiedler vectors also satisfy Cheeger 
inequalities, just like the Fiedler Personalized PageRank vectors. However, Kloumann and Kleinberg 
[T6] recently reported that this degree normalization might slightly degrade the quality of the output 
clusters in terms of matching with ground-truth communities in some real-world networks. So, in our 
experiments, we also try using the PPR score which we just call PPR. We compare the performance 
of the Fiedler PPR and PPR in Section (5] 
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Algorithm 4 Propagation Procedure 

Input: graph G = (V, £”), biconnected core Gc = (Vc,£’c)) communities of Gc '■ Ci {i = G C. 

Output: communities of G. 

1: for each Ci do 

2: Detect bridges Esi attached to C*. 

3; for each bj & Esi do 

4: Detect the whisker wj = {Vj,Ej) which is attached to bj. 

5: Ci=CiLiVj. 

6; end for 
7; end for 


In Algorithm [3l there are two parameters which are related to PPR computation: a and e. We 
follow standard practice for PPR clustering on an undirected graph and set a = 0.99 |19] . This value 
yields results that are similar to those without damping, yet have bounded computational time. The 
parameter e is an accuracy parameter. As e ^ 0, the final vector solution x tends to the exact 
solution of the PageRank linear system. When used for clustering, however, this parameter controls 
the effective size of the final cluster. If e is large (about 10“^), then the output vector is inaccurate, 
incredibly sparse, and the resulting cluster is small. If e is small, say 10“®, then the PageRank vector is 
accurate, nearly dense, and the resulting cluster may be large. We thus run the PPR clustering scheme 
several times, with a range of accuracy parameters that are empirically designed to produce clusters 
with between 1 and 50,000 times the number of edges in the initial seed set. The hnal community we 
select is the one with the best conductance score from these possibilities. 

3.4 Propagation Phase 

Once we get the personalized PageRank communities on the biconnected core, we further expand 
each of the communities to the regions detached in the hltering phase. Our assignment procedure is 
straightforward: for each detached whisker connected via a bridge, we add that piece to all of the 
clusters that utilize the other vertex in the bridge. This procedure is described in Algorithm 01 In 
this way, each community Ci is expanded. 

We now show that our propagation procedure only improves the quality of the final clustering 
result in terms of the normalized cut metric. To do this, we need to fix some notation. Let Esi be a 
set of bridges which are attached to Cj, and Wc^ be a set of whiskers which are attached to the bridges, 
i.e., Wci = {Vwi^Cwi) where 

Wj = {Vj,Ej) G Wc,; Vwi = 

WjGWc. 

Finally, let C' denote the expanded Ci, where |C(| > \Ci\. Equality holds in this expression when there 
is no bridge attached to Ci. When we expand Ci using Algorithm 01 C' is equal to {CiljVwi}- The 
following results show that we only decrease the size of the (normalized) cut by adding the whiskers. 

Theorem 1. If a community Ci is expanded toC[ using Algorithm^ cut{C[) = cut{Ci)—links{VwijCi). 


Proof. Recall that cut(Ci) is dehned as follows: 

cut(Ci) = links(Cj, V \ Ci). 

= links(Cj, V) — links(Cj, Cj). 
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Let us first consider links(C',V) as follows: 

links(C', V) = links(Cj, V) + links(Vvv"., V) — links(Vvi/ijCi)- 

Notice that links(Vvi/^;, V) = liiiks(Vvi/^;, VvkJ + links(VvFi) Ci) by definition of whiskers. 
links(C', V) can be expressed as follows: 

links(C', V) = links(Cj, V) + links(Viy., Vw".). 

On the other hand, links(C^,C') can be expressed as follows: 

links(C',C') = links(Vvyi, VwJ + links(Cj,Ci) + links(Vi4/i, Ci)- 

Now, let us compute cut (Cl) which is defined by 

cut(C') = links(C', V) - links(C', C'). 

By rewriting ([7]) using (l5|) and ([6]), we can express cut(C') as follows: 

cut(C') = cut(Cj) - links(Vvyi,Ci)- 


Thus, 

(5) 

( 6 ) 

(7) 

□ 


Theorem 2. If a community Ci is expanded to C[ using Algorithmic ncut{C[) < ncut{Ci). 
Proof. Recall that 


ncut(Cj) = 


cut {Ci 


links(Cj, V) 

On the other hand, by Theorem [TJ we can represent ncut(C() as follows: 


ncut(C() 


links(C-, V)' 


cut(Ci) - links(Vvv^^,Ci) 
links(Ci,V) + links (Vwi,Vu"J’ 


Therefore, ncut(C() < ncut(Cj). Equality holds when there is no bridge attached to Ci, i.e., = 

0 . ' □ 


3.5 Time Complexity Analysis 

We summarize the time complexity of our overall algorithm in Table El The filtering phase requires 
computing biconnected components in a graph, which takes 0(|V| + |C|) time. The complexity of 
“Graclus centers” seeding strategy is determined by the complexity of hierarchical clustering using 
Graclus. Recall that “Spread hubs” seeding strategy requires nodes to be sorted according to their 
degrees. Thus, the complexity of this strategy is bounded by the sorting operation (we can use a bucket 
sort). Expanding each seed requires solving multiple personalized PageRank clustering problems. The 
complexity of this operation is complicated to state compactly [3], but it scales with the output size 
of each cluster, links(Cj, Vc)- Finally, our simple propagation procedure scans the regions that were 
not included in the biconnected core and attaches them to the final communities. 
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Table 3: Time complexity of each phase. 

Phase 

Time complexity 

Filtering 

o(|v| + l^l) 

... Graclus centers 

Spread hubs 

0(riogfcl(|Vc| + |fci)) 

0(|Vc|) 

Seed expansion 

0{J2i linksfCi, Vc)) 

Propagation 

+Vwi + £wi)) 


4 Related Work 

For overlapping community detection, many different approaches have been proposed [32] including 
clique percolation, line graph partitioning, eigenvector methods, ego network analysis, and low-rank 
models. Clique percolation methods look for overlap between fixed size cliques in the graph [23|. Line 
graph partitioning is also known as link communities. Given a graph G = (y,£), the line graph of 
L{G) (also called the dual graph) has a vertex for each edge in G and an edge whenever two edges (in 
G) share a vertex. For instance, the line graph of a star is a clique. A partitioning of the line graph 
induces an overlapping clustering in the original graph [3|. Even though these clique percolation and 
line graph partitioning methods are known to be useful for finding meaningful overlapping structures, 
these methods often fail to scale to large networks like those we consider. 

Eigenvector methods generalize spectral methods and use a soft clustering scheme applied to 
eigenvectors of the normalized Laplacian or modularity matrix in order to estimate communities [33| . 
Ego network analysis methods use the theory of structural holes [8|, and compute and combine many 
communities through manipulating ego networks 125], M- We compare against the Demon method 
[9] that uses this strategy. We also note that other low-rank methods such as non-negative matrix 
factorizations identify overlapping communities as well. We compare against the Bigclam method |33j 
that uses this approach. 

The approach we employ is called local optimization and expansion [32]. Starting from a seed, 
such a method greedily expands a community around that seed until it reaches a local optima of the 
community detection objective. Determining how to seed a local expansion method is, arguably, a 
critical problem within these methods. Strategies to do so include using maximal cliques m , prior 
information or locally minimal neighborhoods m- The latter method was shown to identify the 
vast majority of good conductance sets in a graph; however, there was no provision made for total 
coverage of all vertices. 

Different optimization objectives and expansion methods can be used in a local expansion method. 
For example, Oslom m tests the statistical signihcance of clusters with respect to a random con¬ 
figuration during community expansion. Starting from a randomly picked node, the Oslom method 
greedily expands the cluster by checking whether the expanded community is statistically significant 
or not, which results in detecting a set of overlapping clusters and outliers in a graph. We compare 
our method with the Oslom method in our experiments (see Section [5|) . 

In our algorithm, we use a personalized PageRank based cut finder [4] for the local expansion 
method. Abrahao et al. [2] observe that the structure of real-world communities can be well captured 
by the random-walk-based algorithms, i.e., personalized PageRank clusters are topologically similar 
to real-world clusters. More recently, Kloumann and Kleinberg m propose to use pure PageRank 
scores instead of the Fiedler PageRank scores to get a higher accuracy in terms of matching with 
ground-truth communities. 

A preliminary version of this work has appeared in m- In this paper, we provide technical 
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Table 4: Returned number of clusters and graph coverage of each algorithm 


Graph 


oslom 

demon 

bigclam 

nise-sph-fppr 

nise-grc-fppr 

HepPh 

coverage (%) 

100 

88.83 

84.37 

100 

100 


no. of clusters 

608 

5,147 

100 

99 

90 

AstroPh 

coverage (%) 

100 

94.15 

91.11 

100 

100 


no. of clusters 

1,241 

8,259 

200 

212 

246 

CondMat 

coverage (%) 

100 

91.16 

99.96 

100 

100 


no. of clusters 

1,534 

10,474 

200 

201 

249 

Flickr 

coverage (%) 

N/A 

N/A 

52.13 

93.60 

100 


no. of clusters 

N/A 

N/A 

15,000 

15,349 

16,347 

LiveJournal 

coverage (%) 

N/A 

N/A 

43.86 

99.78 

99.79 


no. of clusters 

N/A 

N/A 

15,000 

15,058 

16,271 

Myspace 

coverage (%) 

N/A 

N/A 

N/A 

99.87 

100 


no. of clusters 

N/A 

N/A 

N/A 

15,324 

16,366 

DBLP 

coverage (%) 

100 

84.89 

100 

100 

100 


no. of clusters 

17,519 

174,560 

25,000 

26,503 

18,477 

Amazon 

coverage (%) 

100 

79.16 

100 

100 

100 


no. of clusters 

17,082 

105,685 

25,000 

27,763 

20,036 

Orkut 

coverage (%) 

N/A 

N/A 

82.13 

99.99 

100 


no. of clusters 

N/A 

N/A 

25,000 

25,204 

32,622 

LiveJournal2 

coverage (%) 

N/A 

N/A 

56.64 

99.95 

99.99 


no. of clusters 

N/A 

N/A 

25,000 

25,065 

32,274 


details about neighborhood inflation in our seed expansion phase, and include additional experimental 
results to show the importance of the neighborhood inflation step. Also, we test and compare the 
performance of the Fiedler PageRank and the standard PPR in our expansion phase. We also improve 
the implementation of our algorithm in that we try expanding seeds in parallel using multiple threads. 

5 Experimental Results 

We compare our algorithm, nise, with other state-of-the-art overlapping community detection 
methods: Bigclam [33], Demon |9|, and Oslom m- For these three methods, we used the software 
which is provided by the authors of m, and m respectively. While Demon and Oslom only 
support a sequential execution, Bigclam supports a multi-threaded execution, nise is written in 
a mixture of C-|--|- and MATLAB. In nise, seeds can be expanded in parallel, and this feature is 
implemented using parallel computing toolbox provided by MATLAB. We compare the performance 
of each of these methods on ten different real-world networks which are presented in Section [2.41 

Within NISE, we also compare the performance of different seeding strategies and some variants 
of expansion methods. We use four different seeding strategies: “graclus centers” (denoted by “nise- 
grc-*”) and “spread hubs” (denoted by “nise-sph-*”) which are proposed in this manuscript, “locally 
minimal neighborhoods” (denoted by “nise-lcm-*”) which has been proposed in [T3|, and random 
seeding strategy (denoted by “nise-rnd-*”) where we randomly take k seeds. Andersen and Lang [5] 
have provided some theoretical justification for why random seeding also should be competitive. On 
the other hand, we also compare two different expansion methods: the Fiedler Personalized PageRank 
(denoted by “nise-*-fppr”), and the standard Personalized PageRank (denoted by “nise-*-ppr”). 
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Figure 3: Distributions of cluster sizes from the methods. These plots show a kernel density smoothed 
histogram of the clnster sizes from each method. The horizontal axis is the clnster size and the vertical 
axis is proportional to the number of clusters of that size. 


5.1 Graph Coverage and Community Sizes 

We first report the returned number of clusters and the graph coverage of each algorithm in Table 01 
The graph coverage indicates how many vertices are assigned to clusters (i.e., the number of assigned 
vertices divided by the total nnmber of vertices in a graph). Note that we can control the number of 
seeds k in NISE and the number of clusters k in Bigclam. We set k (in onr methods and Bigclam) as 100 
for HepPh, 200 for AstroPh and CondMat, 15,000 for Flickr, Myspace, and LiveJournal, and 25,000 for 
DBLP, Amazon, LiveJonrnal2, and Orkut networks without any tuning and nsing the gnidance that 
larger graphs can have more clusters. For the networks where we have ground-truth communities, we 
slightly overestimate the nnmber of clusters k since there usually exists a large number of ground-truth 
communities. Since we remove duplicate clusters after the PageRank expansion in nise, the returned 
nnmber of clusters can be smaller than k. Also, since we choose all the tied seeds in “graclns centers” 
and “spread hubs”, the returned number of clusters of these algorithms can be slightly larger than k. 
Recall that we use a top-down hierarchical clustering scheme in the “graclns centers” strategy. So, in 
this case, the returned number of clusters before filtering the duplicate clusters is slightly greater than 
or equal to 2^^°®^^. On the other hand, Demon and Oslom determine the number of clusters based on 
datasets. Demon and Oslom fail on Flickr, Myspace, LiveJonrnal, LiveJournal2, and Orkut. Bigclam 
does not finish on the Myspace network (nsing 4 threads) after rnnning for one week. 

Figure [3] shows distributions of cluster sizes. These figures show that the nise method tends to find 
larger clnsters than the other methods, usnally about 10 to 100 times as large. Also, the nise method 
often finds a number of large clusters—these are the spikes on the right for subfignres (f)“(j). This 
tends to happen slightly more often for the “graclus centers” seeding strategy. The other observation 
is that NISE tends to produce more variance in the sizes of the clusters than the other methods and 
the resulting histograms are not as sharply peaked. 
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Figure 4: Importance of neighborhood inflation - there is a large performance gap between singleton 
seeds and neighborhood-inflated seeds for all the seeding strategies. Neighborhood inflation plays a 
critical role in the success of NISE. When neighborhood-inflated seeds are used, “graclus centers” and 
“spread hubs” seeding strategies signihcantly outperform other seeding strategies. 


5.2 Importance of Neighborhood-Inflation 

We evaluate the quality of overlapping clustering in terms of the maximum conductance of any cluster. 
A high quality algorithm should return a set of clusters that covers a large portion of the graph with 
small maximum conductance. This metric can be presented by a conductance-vs-coverage curve. That 
is, for each method, we hrst sort the clusters according to the conductance scores in ascending order, 
and then greedily take clusters until a certain percentage of the graph is covered. The x-axis of each 
plot is the graph coverage, and the y-axis is the maximum conductance value among the clusters we 
take. We can interpret this plot as follows: we need to use clusters whose conductance scores are less 
than or equal to y to cover x percentage of the graph. Note that lower conductance indicates better 
quality of clusters, i.e., a lower curve indicates better clusters. 

First, we verify the importance of neighborhood inflation in our seed expansion phase. Recall that 
when we compute the personalized PageRank (PPR) score for each seed node, we use the seed node’s 
entire vertex neighborhood (the vertex neighborhood is also referred to as “ego network”) as the 
restart region in PPR (details are in Section [3.3p . To see how this affects the overall performance of 
the seed expansion method, we compare the performance of singleton seeds and neighborhood-inflated 
seeds. Figured shows the conductance-vs-coverage plot for singleton seeds and neighborhood-inflated 
seeds, “^-single” indicates singleton seeds, i.e., each seed is solely used as the restart region in PPR. 
“*-ego” indicates neighborhood-inflated seeds. We also used four different seeding strategies: “graclus 
centers” (denoted by “grc-*”), “spread hubs” (denoted by “sph-*”), “locally minimal neighborhoods” 
(denoted by “1cm-*”), and “random” (denoted by “rnd-*”). 

We can see that the performance significantly degrades when singleton seeds are used for all 
the seeding strategies. This implies that neighborhood inflation plays a critical role in the success 
of our method. Even though we only present the results on LiveJournal, Myspace, and Flickr in 
Figure 0] for brevity, we consistently observed that neighborhood-inflated seeds are much better than 
singleton seeds on all other networks. We also notice that that when neighborhood-inflated seeds 
are used, both “graclus centers” and “spread hubs” seeding strategies significantly outperform other 
seeding strategies, “spread hubs” and “graclus centers” seeding strategies produce similar results on 
LiveJournal whereas “graclus centers” is better than “spread hubs” on Myspace and Flickr. We used 
the conventional Fiedler PPR for the expansion phase in Figured! but we also got the same conclusion 
using the standard PPR. 
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Figure 5: AUC of Conductance-vs-coverage - lower bar indicates better communities, nise out¬ 
performs Demon, Oslom, and Bigclam. Within nise, “graclus centers” and “spread hubs” seeding 
strategies are better than other seeding strategies, and the Fiedler PPR produces slightly better com¬ 
munities than the standard PPR. 

5.3 Community Quality Using Conductance 

We compute AUC (Area Under the Curve) of the conductance-vs-coverage to compare the performance 
of NISE with other state-of-the-art methods. Within nise, we also compare four different seeding 
strategies and two different expansion methods. The AUC scores are normalized such that they are 
between zero and one. 

Figure [5] shows AUC scores on the six networks where we do not have ground-trnth community 
information (see Tabled] for details about these networks). We can see several patterns in Figure [5] 
First, within NISE, “graclus centers” and “spread hubs” seeding strategies outperform the other two 
seeding strategies. Second, for most of the cases, “fppr” leads to slightly better communities than 
“ppr”. Also, we can see that “nise-grc-fppr” shows the best performance for all networks. Third, nise 
outperforms Demon, Oslom, and Bigclam. There is a significant performance gap between nise and 
these methods. 


5.4 Community Quality via Ground-truth 

We have ground-truth communities for the DBLP, Amazon, LiveJournal2, and Orkut networks, thus, 
for these networks, we compare against the ground-truth communities. Given a set of algorithmic 
communities C and the ground-truth communities 5, we compute Fi measure and F 2 measure to 
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Table 5: FI and F2 measures. 



DBLP 

Fi 

F 2 

Amazon 

Fi 

F 2 

LiveJournal2 

Fi F 2 

Orkut 

Fi 

F 2 

bigclam 

15.1 % 

13.0 % 

27.1 % 

25.6 % 

11.3 % 

13.7 % 

43.0 % 

47.4 % 

demon 

13.7 % 

12.0 % 

16.5 % 

15.3 % 

N/A 

N/A 

N/A 

N/A 

oslom 

13.4 % 

11.6 % 

32.0 % 

30.2 % 

N/A 

N/A 

N/A 

N/A 

nise-lcm-fppr 

13.9 % 

15.4 % 

46.3 % 

56.5 % 

11.3 % 

13.8 % 

40.9 % 

46.8 % 

nise-rnd-fppr 

17.7 % 

20.5 % 

48.9 % 

58.8 % 

12.1 % 

16.5 % 

54.6 % 

62.9 % 

nise-sph-fppr 

18.1 % 

21.4 % 

49.2 % 

59.5 % 

12.7 % 

18.1 % 

55.1 % 

64.2 % 

nise-sph-ppr 

19.0 % 

22.6 % 

49.7 % 

58.7 % 

12.8 % 

18.1 % 

57.4 % 

65.2 % 

nise-grc-fppr 

17.6 % 

21.7 % 

46.7 % 

57.1 % 

12.2 % 

17.6 % 

51.1 % 

61.4 % 

nise-grc-ppr 

17.6 % 

22.0 % 

47.3 % 

56.0 % 

12.8 % 

17.6 % 

53.5 % 

62.4 % 


evaluate the relevance between the algorithmic communities and the ground-truth communities. In 
general, measure is defined as follows: 


p ,'c'l _ n a 2 ^ precision{Si) ■ recall{Si) 

I )p 2 . precisioniSi) + recall{Si) 

where /? is a non-negative real value, and the precision and recall of <Sj G 5 are dehned as follows: 


precision (Si) 


\Cj n 
|C,| 


recall{Si) 


\CJns^ 


where Cj G C, and 
is defined to be 


Fji{Si,Cj*) where j* = argmax F/^{Si,Cj). Then, the average Fp measure 

j 


Given an algorithmic community, precision indicates how many vertices are actually in the same 
ground-truth community. Given a ground-truth community, recall indicates how many vertices are 
predicted to be in the same community in a retrieved community. By definition, the precision and 
the recall are evenly weighted in Fi measure. On the other hand, the F 2 measure puts more emphasis 
on recall than precision. The authors in |33j who provided the datasets argue that it is important 
to quantify the recall since the ground-truth communities in these datasets are partially annotated, 
i.e., some vertices are not annotated to be a part of the ground-truth community even though they 
actually belong to that community. This indicates that it would be reasonable to weight recall higher 
than precision, which is done by the F 2 measure. 

In Table [5j we report the average Fi and F 2 measures on DBLP, Amazon, LiveJournal2, and 
Orkut networks. A higher value indicates better communities. We see that nise outperforms Bigclam, 
Demon, and Oslom in terms of both Fi and F 2 measures on these networks. Within nise, “spread 
hubs” seeding is better than “graclus centers” seeding, and the standard PPR is slightly better than 
the Fiedler PPR in most of the cases. So, we see that the standard PPR is useful for identifying 
ground-truth communities. This result is also consistent with the recent observations in |16] . 
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Table 6: Running Times of different methods on our test networks 


Graph 

oslom 

demon 

bigclam 

nise-sph-fppr 

nise-grc-fppr 

HepPh 

19 mins. 16 secs. 

27 secs. 

11 mins. 23 secs. 

22 secs. 

2 mins. 48 secs. 

AstroPh 

38 mins. 3 secs. 

42 secs. 

48 mins. 1 secs. 

36 secs. 

2 mins. 26 secs. 

CondMat 

20 mins. 39 secs. 

50 secs. 

7 mins. 21 secs. 

36 secs. 

1 min. 14 secs. 

DBLP 

5 hrs. 50 mins. 

3 hrs. 53 mins. 

7 hrs. 13 mins. 

18 mins. 20 secs. 

29 mins. 44 secs. 

Amazon 

2 hrs. 55 mins. 

1 hr. 55 mins. 

1 hr. 25 mins. 

37 mins. 36 secs. 

42 mins. 43 secs. 

Flickr 

N/A 

N/A 

69 hrs. 59 mins. 

43 mins. 55 secs. 

3 hrs. 56 mins. 

Orkut 

N/A 

N/A 

13 hrs. 48 mins. 

1 hrs. 16 mins. 

4 hrs. 16 mins. 

LiveJournal 

N/A 

N/A 

65 hrs. 30 mins. 

2 hrs. 36 mins. 

4 hrs. 48 mins. 

LiveJournal2 

N/A 

N/A 

21 hrs. 35 mins. 

2 hrs. 15 mins. 

6 hrs. 37 mins. 

Myspace 

N/A 

N/A 

> 7 days 

5 hrs. 27 mins. 

9 hrs. 42 mins. 


5.5 Comparison of Running Times 

Finally, we compare the running times of the different algorithms in Table [6l To do a fair comparison, 
we run the single thread version of Bigclam and nise for HepPh, AstroPh, CondMat, DBLP, and 
Amazon networks. Since Demon and Oslom fail on larger networks, we use the multi-threaded version 
of Bigclam and nise with 4 threads for larger networks. We see that nise is the only method which 
can process the largest dataset (Myspace) in a reasonable time. On small networks (HepPh, AstroPh, 
and CondMat), “nise-sph-fppr” is faster than Demon, Oslom and Bigclam. On medium size networks 
(DBLP and Amazon), both “nise-grc-fppr” and “nise-sph-fppr” are faster than other methods. On 
large networks (Flickr, Orkut, LiveJournal, LiveJournal2, Myspace), nise is much faster than Bigclam. 

6 Discussion and Conclusion 

We now discuss the results from our experimental investigations. First, we note that nise is the only 
method that worked on all of the problems. Also, our method is faster than other state-of-the-art 
overlapping community detection methods. Perhaps surprisingly, the major difference in cost between 
using “graclus centers” for the seeds and the other seed choices does not result from the expense 
of running Graclus. Rather, it arises because the personalized PageRank expansion technique takes 
longer for the seeds chosen by Graclus. When the PageRank expansion methods has a larger input 
set, it tends to take longer, and the “graclus centers” seeding strategy is likely to produce larger input 
sets because of the neighborhood inflation and because the central vertices of clusters are likely to be 
high degree vertices. 

We wish to address the relationship between our results and some prior observations on overlapping 
communities. The authors of Bigclam found that the dense regions of a graph reflect areas of overlap 
between overlapping communities. By using a conductance measure, we ought to find only these 
dense regions - however, our method produces much larger communities that cover the entire graph. 
The reason for this difference is that we use the entire vertex neighborhood as the restart for the 
personalized PageRank expansion routine. We avoid seeding exclusively inside a dense region by 
using an entire vertex neighborhood as a seed, which grows the set beyond the dense region. Thus, 
the communities we find likely capture a combination of communities given by the ego network of the 
original seed node. 

Overall, nise significantly outperforms other state-of-the-art overlapping community detection 
methods in terms of runtime, conductance-vs-coverage, and ground-truth accuracy. Also, our new 
seeding strategies, “graclus centers” and “spread hubs”, are superior than existing methods, thus play 
an important role in the success of our seed set expansion method. 
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